Pre Processing Techniques for Arabic Documents Clustering
نویسنده
چکیده
Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: term pruning, term weighting using (TF-IDF), morphological analysis techniques using (root-based stemming, light stemming, and raw text), and normalization. Experimental work examined the effect of clustering algorithms using a most widely used partitional algorithm, K-means, compared with other clustering partitional algorithm, Expectation Maximization (EM) algorithm. Comparison between the effect of both Euclidean Distance and Manhattan similarity measurement function was attempted in order to produce best results in document clustering. Results were investigated by measuring evaluation of clustered documents in many cases of preprocessing techniques. Experimental results show that evaluation of document clustering can be enhanced by implementing term weighting (TF-IDF) and term pruning with small value for minimum term frequency. In morphological analysis, light stemming, is found more appropriate than root-based stemming and raw text. Normalization, also improved clustering process of Arabic documents, and evaluation is
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملBig Data Categorization for Arabic Text Using Latent Semantic Indexing and Clustering
Documents categorization is an important field in the area of natural language processing. In this paper, we propose using Latent Semantic Indexing (LSI), singular value decomposing (SVD) method, and clustering techniques to group similar unlabeled document into pre-specified number of topics. The generated groups are then categorized using a suitable label. For clustering, we used Expectation–...
متن کاملRotation and Scale Invariant Feature Extraction Using Complex Zernike Moments Forfarsiand Arabic Handwriting Character
Analyzing Farsi and Arabic handwritten documents is one area in image processing whose target is to transform picture documents into symbolic form. This transformation is conducted o make rapid and easy saving, improvements, retrieval, reuse, searching and transferring documents. Analyzing documents is performed in five stages: pre-processing, segmentation representation, recognition and post-p...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملUsing Clustering Techniques for on-segmented Language Document Management: A Comparison of K-mean and Self Organizing Map Techniques
Since the number of electronics non-segmented language documents is growing very fast, efficient document clustering techniques for non-segmented languages are needed as a tool in today’s world where a lot of documents are stored and retrieved electronically. It enables one to group the similar documents using keywords or terms of the clusters. Thus document clustering can be used to group and ...
متن کامل